This assignment is for ETC5521 Assignment 2 by Team Cassowary comprising of Sahinya Akila and Xinrui Wang.
Gender and race have always been concerns in the workplace, which will be very helpful in looking through the fairness, equality and diversity. It raises our interest on exploring and conducting a detailed analysis in regards to the employment and earnings in different industries in the USA, using the data collected from tidytuesday.
However, due to the limitation of data including inadequate information about different regions and inconsistency of two datasets in terms of the time frame, the outcome of the analysis would be affected.
The datasets originally come from BLS, specifically table cpsaat17 across several years.
The employed dataset tells about employed persons by industry, sex, race, and occupation through 2015 to 2020.| Column_name | data_type | description |
|---|---|---|
| industry | character | Industry Group |
| major_occupation | character | Major occupation category |
| minor_occupation | character | Minor occupation category |
| race_gender | character | Race & Gender wise information |
| industry_total | double | Industry total count |
| employ_n | double | Number of people employed |
| year | double | Year |
| Column_name | data_type | description |
|---|---|---|
| sex | character | Gender |
| race | character | Racial group |
| ethnic_origin | character | Ethnic origin (hispanic or non-hispanic) |
| age | character | Age group |
| year | double | Year |
| quarter | double | Quarter |
| n_persons | double | Number of persons employed by group |
| median_weekly_earn | double | Median weekly earning in current dollars |
The datasets are collected from the Current Population Survey (CPS) which is a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics.
Here are some findings when looking through the methods used to tidy and wrangle data from the original source:
employed data
The raw data is in excel format. The author of tidytuesday firstly took one year in the data as an example to clean, using slice(), rename() etc functions to display the titles and data itself of the original table clearly and properly. Then, in order to have each variable corresponding to one column, pivot_longer() was used. After that, the author got rid of those redundant characters by regexp and selected the required data. With these steps, it is about to finish cleaning the data for a given year. What to do next is to create a function referring to the steps above and apply the function to combine all years. Yet, it’s necessary to have the tidy data checked by simply making a plot using ggplot2 function. Finally, the data can be output by write_csv().
earn data
The raw data is in excel format. The author changed it to a table format using html_nodes() and html_table(). Similarly as in the employed data, a function was created and data was combined together with the functions bind_rows() and left_join(). Then, with similar steps,the final cleaned data can be acquired through basic tidy methods like filter(), select(), mutate() etc. Last but not the least, the data can be checked and output.
Based on the datasets, three questions are going to be explored and analyzed in the following section, including:
What are the changes of people employed in different industries from 2015 to 2020?
What are the demographic differences between industries from 2015 to 2020?
How do different factors affect the income between 2010 and 2020?
First of all, the graph shown above indicates the changes of all the population of employees from different industries in recent 5 years. To be more specific, there is a large number of people working in the industry of education and health services, and the population stayed stably between 34 million and 35 million during 2015 to 2020. However, as the industry of private households hold the least population, the number of people employed in this industry actually decreased from around 0.7 million to 0.6 million. In addition, all industries experienced the decrease of people employed within the industries from 2019 to 2020 except for the public administration.
In the analysis about genders in different industries, it is found that there are only five industries that have more female employees than male, which are education and health services, financial activities, leisure and hospitality, other services and private households. Especially in the industry of education and services, the number of female employees is more than twice as much as the number of male employees. On the contrary, male workers occupy most of the roles in some industries like manufacturing, construction, transportation and utilities and durable goods. More than 90% of the employees are male in the industry of construction.
When looking at the relationships between industries and the population employed among races from the data, most of the people employed among all the industries are white people, following by Black or African American and Asian.
It can be observed from the figure that both men and women in between the age 16 to 54 have been employed more when compared to other age groups. It is also evident that the number of male employees are more when compared to women. There is a peak in 25-54 age group as this is the age when people finish education and start their career. This also happens to be the prime working time in most of their lives. As one intends the curve to be, there is a peak and the 25-54 age group and the numbers slowly go down after 55 years as people start their retirement phase.
When taking a look at the earning data, median weekly income varies through different genders, races, ethnic origins and age groups.
The above plot shows apparent relationship among gender, race and weekly income through years. There’s a obvious upward trend in income year by year, the upper vertex of the segments represents male’s income and the lower one represents female’s, which clearly shows that men generally earn more than women. Besides, the plot indicates that race is also a key factor affecting income. Asian earns the most, followed by the white race while the black or African American earns the least. This may reflect differences in the amount of time and energy that people of different races are willing to devote to their jobs. On the other hand, racial discrimination may also play a role.
Based on the figure above, income levels at different age groups are shown. Here, the Y-axis is divided by the minimum, 1/4 quantile, median, 3/4 quantile and maximum income of the total. The plot interactively demonstrates that young adults earn much less than middle-aged people and there’s not much difference in age groups over 35.
Since the data breaks out only for the ethnic origin of Hispanics or Latinos who are minorities in the United States, let’s just plot their income and see if there’s any difference.
Median weekly earn of Hispanics or Latinos
Compare to the median weekly income of total, the boxplot illustrates roughly low income for Hispanics or Latinos, even lower than the black or African American. The possible reasons for the circumstance can be related to education level, personal will, government support, employment discrimination, etc.
Combining these two data analysis on the employment situation and income level in the United States, some preliminary conclusions can be drawn:
In terms of gender, industries that are overwhelmingly male generally require more physical labor and technical skills, while industries with more women generally require more patience and carefulness. On this basis, men generally earn more than women, which illustrates that a higher level of technology or professional skills may lead to higher income.
In terms of race, the number of white people employed is much higher than that of other races. On the one hand, the white population base in the United States is much higher than that of Asians, and on the other hand, whites are not subject to racial discrimination while blacks or African Americans are often the first victims. However, from the income, we find that Asians earn the most. This is mainly due to the fact that Asian immigrants to the United States generally have high education and/or high technology, and generally have high economic strength and educational level, including overseas students. Thus, their income is relatively high. Blacks and Hispanics or Latinos earn less possibly because of employment discrimination or laziness.
In terms of age groups, middle aged people earn most, a lot higher than young adults, which is rather reasonable. This indicates that the United States attaches great importance to work experience and has a relatively stable prospect of promotion and salary increase, which to some extent reflects the sound and stable corporate policy and roughly perfect social welfare system in the United States.
The decline in U.S. employment from 2019 to 2020 is likely due to an increase in layoffs during the economic depression. (Possibly affected by COVID-19)
Labor Force Statistics from the Current Population Survey. (2021). Retrieved 15 August 2021, from https://www.bls.gov/cps/tables.htm#charemp_m
Tidytuesday. (2021). Retrieved 15 August 2021, from https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/readme.md
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr
Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3.https://CRAN.R-project.org/package=tidyr
Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.10. URL https://rmarkdown.rstudio.com.
Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.3. https://CRAN.R-project.org/package=tibble
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Wilke, C.O. (2020). ggtext: Improved Text Rendering Support for ‘ggplot2’. R package version 0.1.1. https://CRAN.R-project.org/package=ggtext
Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.
Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN 9780367563837. URL https://bookdown.org/yihui/rmarkdown-cookbook.